Sparse Clustering Overview

Paul Harmon

August 29, 2018


logo

Introduction

This document is designed to cover some of the high points of sparse clustering and its implementation in R using the sparcl.

A Note: Sparcl package was removed from CRAN on 7/20/2018 “as check problems were not corrected despite reminders.”(See page at https://cran.r-project.org/web/packages/sparcl/index.html.) However, there is another version out there by Kondo, Salibian-Bareera and Zamar (2016) called RSKC.

How does Sparse Clustering Work?

Iplementations in R:

#installs and libraries the RSKC package
#install.packages('RSKC')
library(RSKC)
## Loading required package: flexclust
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
#installs and libraries the Sparcl package (possibly deprecated)
#install the most recent version then use R-studio to install from archived file
#install.packages("C:/Users/r74t532/Downloads/sparcl_1.0.3.tar.gz", repos = NULL, type = "source")
#devtools::install_version('sparcl',version = '1.0.3')

Datasets

NBA Players

nba <- read.csv('data/na.csv', header = TRUE)

#pca-based
pc1 <- prcomp(nba[,-c(1,2,3)], scale = TRUE, center = TRUE)
summary(pc1)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.0615 1.9545 1.27386 1.21837 0.98520 0.90297
## Proportion of Variance 0.4687 0.1910 0.08114 0.07422 0.04853 0.04077
## Cumulative Proportion  0.4687 0.6596 0.74079 0.81501 0.86354 0.90431
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.73939 0.64388 0.57694 0.44294 0.43444 0.33548
## Proportion of Variance 0.02734 0.02073 0.01664 0.00981 0.00944 0.00563
## Cumulative Proportion  0.93164 0.95237 0.96902 0.97883 0.98826 0.99389
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.26188 0.17055 0.11187 0.07452 0.05996 0.04939
## Proportion of Variance 0.00343 0.00145 0.00063 0.00028 0.00018 0.00012
## Cumulative Proportion  0.99732 0.99877 0.99940 0.99968 0.99986 0.99998
##                           PC19    PC20
## Standard deviation     0.01562 0.01345
## Proportion of Variance 0.00001 0.00001
## Cumulative Proportion  0.99999 1.00000
#do some clustering
library(mclust)
## Package 'mclust' version 5.4.1
## Type 'citation("mclust")' for citing this R package in publications.
modclust <- mclustBIC(pc1$x[,1:2])
mc <- Mclust(pc1$x[,1:2], x = modclust)

#build a dataframe
dat1 <- tibble(pc1$x[,1],pc1$x[,2],mc$classification); names(dat1) <- c('s1','s2','class')

#build a plot
p <- ggplot(dat1) + geom_point(aes(s1,s2,color = factor(class))) + ggtitle("NBA Rookies 2017") + theme_classic()

ggplotly(p)
##################
##Compare the Sparse Clustering Methods to this: 
library(RSKC)
spk <- RSKC(nba[,-c(1,2,3)], ncl = 3, alpha = 0, L1 = 1) # Sparse K-Means
#see documentation but alpha = 0 and l1 = 1 gives sparse K means
#to get "robust" sparse k-means, we need alpha >0 and L1 = 1

rspk <- RSKC(nba[,-c(1,2,3)], ncl = 3, alpha = .5, L1 = 1) # Robust Sparse K-Means

dat1$spk <- spk$labels
dat1$rspk <- rspk$labels

#gives the sparse k-means 
p2 <- ggplot(dat1) + geom_point(aes(s1,s2,color = factor(spk))) + ggtitle("NBA Rookies 2017") + theme_classic()
ggplotly(p2)
#gives the robust sparse k-means
p3 <- ggplot(dat1) + geom_point(aes(s1,s2,color = factor(rspk))) + ggtitle("NBA Rookies 2017") + theme_classic()
ggplotly(p3)

Carnegie Classifications

In this case, we have \(n = 335 institutions\) with 8 different characteristics.